Topic Modelling on The NSF Research Awards Abstracts dataset using BERTopic¶

In this notebook, I'll apply topic modelling techniques using text embeddings with BERTopic. Topic modelling use a corpus of text to identify clusters or groups of common themes or similar words.

Traditionally, for topic modelling, we would use techniques like LDA (Latent Dirichlet Allocation) or LSA (Latent Semantic Analysis). However, in this notebook, I will use embeddings generated by models based in transformers to get a dense representation of the meaning of each text. In this case, each text represents one abstract from the NSF Research Awards Abstracts dataset.

I'll start by reading the data that was prepared by another notebook.

In [1]:
import pandas as pd
import os

abstracts_df = pd.read_csv(os.path.join('data', 'processed', 'abstracts.csv'))
# https://www.nsf.gov/awardsearch/showAward?AWD_ID=2053734&HistoricalAwards=false
abstracts_df.dropna(subset=['award_id', 'abstract'], inplace=True)

Train the topic model¶

I'll train the topic model using BERTopic with a modification in the default value of min_topic_size. I increased the value to 50 to get more interesting topics with more abstracts in them. We use a vectorizer to remove stop words after getting the embeddings and finding the topics as advised in the section of tips and tricks

In [2]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(verbose=True, min_topic_size=50, vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(abstracts_df['abstract'].tolist())
num_topics = len(topic_model.get_topic_info())
print(f'# of topics discovered: {num_topics}')
Batches:   0%|          | 0/412 [00:00<?, ?it/s]
2023-10-06 15:14:16,447 - BERTopic - Transformed documents to Embeddings
2023-10-06 15:14:34,508 - BERTopic - Reduced dimensionality
2023-10-06 15:14:34,897 - BERTopic - Clustered reduced embeddings
# of topics discovered: 45

Topic representation¶

I discovered 45 topics. Let's analyze the top 8 of them. According to the documentation, we can ignore the Topic -1that belongs to the outliers.

In [3]:
topic_model.get_topic_info().head(9)
Out[3]:
Topic Count Name
0 -1 5224 -1_research_project_using_students
1 0 748 0_covid19_social_pandemic_health
2 1 593 1_species_plant_research_project
3 2 592 2_physics_stars_universe_matter
4 3 518 3_theory_geometry_equations_algebraic
5 4 400 4_stem_engineering_learning_education
6 5 383 5_ice_climate_ocean_sea
7 6 347 6_cells_cell_proteins_protein
8 7 334 7_mantle_seismic_earthquakes_subduction
In [4]:
topic_model.visualize_barchart(top_n_topics=8, n_words=10, height=700)

From the last visualization, I can see 8 different topics: 0) Pandemic COVID-19 1) Botanics/Ecology 2) Astro-physics 3) Maths 4) STEM Education 5) Climate Change 6) Micro-biology/Biochemistry 7) Geology (Seism and volcanic activity)

Topic relationships¶

I'd like to check out the uniqueness of each topic. Some topics can be similar to others, so we merge them o we can explore them further. To do this, we use a 2D representation of the topics via UMAP.

In [5]:
topic_model.visualize_topics(top_n_topics=num_topics)

Another way of visualizing the relationships among topics is using hierarchy generated by the HDBSCAN algorithm used to generate the topics

In [6]:
topic_model.visualize_hierarchy(top_n_topics=num_topics, width=1000)

References¶

  • Topic Modeling arXiv Abstract with BERTopic
  • Tips & Tricks - BERTopic